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Abstract. This article is about estimation and inference methods for high dimensional sparse 
(HDS) regression models in econometrics. High dimensional sparse models arise in situations 
where many regressors (or series terms) are available and the regression function is well- 
approximated by a parsimonious, yet unknown set of regressors. The latter condition makes 
it possible to estimate the entire regression function effectively by searching for approximately 
the right set of regressors. We discuss methods for identifying this set of regressors and esti- 
mating their coefficients based on i'l-penalization and describe key theoretical results. In order 
to capture realistic practical situations, we expressly allow for imperfect selection of regressors 
and study the impact of this imperfect selection on estimation and inference results. We focus 
the main part of the article on the use of HDS models and methods in the instrumental vari- 
ables model and the partially linear model. We present a set of novel inference results for these 
models and illustrate their use with applications to returns to schooling and growth regression. 
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1. Introduction 

' We consider linear, high dimensional sparse (HDS) regression models in econometrics. The 

HDS regression model allows for a large number of regressors, p, which is possibly much larger 
than the sample size, n, but imposes that the model is sparse. That is, we assume only 
^ ■ s « n of these regressors are important for capturing the main features of the regression 

function. This assumption makes it possible to estimate HDS models effectively by searching 
for approximately the right set of regressors. In this article, we review estimation methods 
for HDS models that make use of ^-penalization and then provide a set of novel inference 
results. We also provide empirical examples that illustrate the potential wide applicability of 
HDS models and methods in econometrics. 
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The motivation for considering HDS models comes in part from the wide availability of 
data sets with many regressors. For example, the American Housing Survey records prices 
as well as a multitude of features of houses sold; and scanner data-sets record prices and 
numerous characteristics of products sold at a store or on the internet. HDS models are 
also partly motivated by the use of series methods in econometrics. Series methods use many 
constructed or series regressors - regressors formed as transformation of elementary regressors - 
to approximate regression functions. In these applications, it is important to have parsimonious 
yet accurate approximation of the regression function. One way to achieve this is to use the 
data to select a small of number of informative terms from among a very large set of control 
variables or approximating functions. In this article, we formally discuss doing this selection 
and estimating the regression function. 

We organize the article as follows. In the next section, we introduce the concepts of sparse 
and approximately sparse regression models in the canonical context of modeling a conditional 
mean function and motivate the use of HDS models via an empirical and analytical examples. 
In Section [3l we discuss some principal estimation methods and mention extensions of these 
methods to applications beyond conditional mean models. We discuss some key estimation 
results for HDS methods and mention various extensions of these results in Section 01 We then 
develop HDS models and methods in instrumental variables models with many instruments 
in Section [5] and a partially linear model with many series terms in Section [6l with the main 
emphasis given to inference. Finally, we present two empirical examples which motivate the 
use of these methods in Section 

Notation. We allow for the models to change with the sample size, i.e. we allow for 
array asymptotics. In particular we assume that p = p n grows to infinity as n grows, and 
s = s n can also grow with n, although we require that s logp = o(n). Thus, all parameters are 
implicitly indexed by the sample size n, but we omit the index to simplify notation. We also 
use the following empirical process notation, E n [/] = E n [/(zj)] = Y17=i f( z i)/ n - The /2-norm is 
denoted by || • || , and the Zo-norm, || • ||o, denotes the number of non-zero components of a vector. 
We use || • ||oo to denote the maximal element of a vector. Given a vector S 6 MP, and a set of 
indices T C {1, . . . ,p}, we denote by 5t £ MP the vector in which 5xj = Sj if j & T, 5Tj = if 
j ^ T. We use the notation (a)_j_ = max{a,0}, a V b = max{a, 6} and a A b = min{a, b}. We 
also use the notation a < b to denote a ^ cb for some constant c > that does not depend on 
n; and a <p b to denote a = Op(b). For an event E, we say that E wp — >• 1 when E occurs 
with probability approaching one as n grows. 



2. Sparse and Approximately Sparse Regression Models 

In this section we review the modeling foundations for HDS methods and provide motivating 
examples with emphasis on applications in econometrics. First, let us consider the following 
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parametric linear regression model: 

y t = x% + e u e 4 ~iV(0,a 2 ), ft £ l p , i = l,...,n 

T = support(/?o) has s elements where s < n, 

where p > n is allowed, T is unknown, and regressors X = [x\, . . . , x n }' are fixed. We assume 
Gaussian errors to simplify the presentation of the main ideas throughout the article, but note 
that this assumption can be eliminated without substantially altering the results. It is clear 
that simply regressing y on all p available x variables is problematic when p is large relative to 
n which motivates consideration of models that impose some regularization on the estimation 
problem. 

The key assumption that allows effective use of this large set of covariates is sparsity of the 
model of interest. Sparsity refers to the condition that only s n elements of (3q are non- 
zero but allows the identities of these elements to be unknown. Sparsity can be motivated on 
economic grounds in situations where a researcher believes that the economic outcome could be 
well-predicted by a small (relative to the sample size) number of factors but is unsure about the 
identity of the relevant factors. Note that we allow s = s n to grow with n, as mentioned in the 
notation section, although slogp = o(n) will be required for consistency. This simple sparse 
model substantially generalizes the classical parametric linear model by letting the identities, 
T, of the relevant regressors be unknown. This generalization is useful in practice since it is 
problematic to assume that we know the identities of the relevant regressors in many examples. 

The previous model is simple and allows us to convey the essential ideas of the sparsity- 
based approach. However, it is unrealistic in that it presumes exact sparsity or that, after 
accounting for s main regressors, the error in approximating the regression function is zero. 
We shall make no formal use of the previous model, but instead use a much more general, 
approximately sparse or nonparametric model. In this model, all of the regressors potentially 
have a non-zero contribution to the regression function, but no more than s unknown regressors 
are needed for approximating the regression function with a sufficient degree of accuracy. 

We formally define the approximately sparse model as follows. 

Condition ASM. We have data {(yi,Zi),i = l,...,n} that for each n obey the regression 
model: 

yi = f{zi) + e h £i ~iV(0,CT 2 ), i = l,...,n, (2.1) 

where yi is the outcome variable, z% is a k z -vector of elementary regressors, f(zi) is the re- 
gression function, and Ci are i.i.d. disturbances. Let Xj = P(zi), where P(zi) is a vector of 
dimension p = p n , that contains a dictionary of possibly technical transformations of Zi, in- 
cluding a constant. The values treated fixed, and normalized so that K n [x?j] = 1 
for j = l,...,p. The regression function f(zi) admits the approximately sparse form, namely 
there exists (5q such that 

f(zi)=x% + n, ||/3o||oO, c s := {E n [rf]} 1/2 < Ka^JTJn. (2.2) 
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where s = s n = o{nj log p) and K is a constant independent of n. 

In the set-up we consider the fixed design case, which covers random sampling as a special 
case where x\, . . . ,x n represent a realization of this sample on which we condition through- 
out. The vector Xi = P(zi) can include polynomial or spline transformations of the original 
regressors Z{ see, e.g., Newey (1997) and Chen (2007) for various examples of series terms. The 
approximate sparsity can be motivated similarly to Newey (1997), who assumes that the first 
s = s n series terms can approximate the nonparametric regression function well. Condition 
ASM is more general in that it does not impose that the most important s = s n terms in 
the approximating dictionary are the first s terms; in fact, the identity of the most important 
terms is treated as unknown. We note that in the parametric case, we may naturally choose 
x ifio = f( z i) so that rj = for all i = 1, . . . , n. In the nonparametric case, we may think of 
X'/3o as any sparse parametric model that yields a good approximation to the true regression 
function f(zi) in equation (|2.ip so that is "small" relative to the conjectured size of the 
estimation error. Given (12.21) . our target in estimation is the parametric function x'^q, where 
we can call 

T := support(/3o) 

the "true" model. Here we emphasize that the ultimate target in estimation is, of course, 
f(zi). The function x'^q is simply a convenient intermediate target introduced so that we 
can approach the estimation problem as if it were parametric. Indeed, the two targets, f{zi) 
and x'ifio, are equal up to the approximation error rj. Thus, the problem of estimating the 
parametric target x^flo is equivalent to the problem of estimating the nonparametric target 
f(z{) modulo approximation errors. 

One way to explicitly construct a good approximating model /3o for (|2.2p is by taking /3o as 
the solution to 

mm l ]£„[(/(*) " x'lPf] + o* 1 ^. (2-3) 

We can call (|2.3p the oracle problem^ and so we can call T = support(/3o) the oracle model. 
Note that we necessarily have that s = \\Pq\\ ^ n. The oracle problem (|2.3p balances the 
approximation error E n [(/(zj) — x-/3) 2 ] over the design points with the variance term a" 2 ||/3||o/n, 
where the latter is determined by the number of non-zero coefficients in /?. Letting c 2 := 
Enb'f] = ^n[{f{zi) — x-/3o) 2 ] denote the squared error from approximating values f(zi) by a^/3o> 
the quantity c 2 s + a 2 s/n is the optimal value of (|2.3p . In common nonparametric problems, 
such as the one described below, the optimal solution in (|2.3|) would balance the approximation 
error with the variance term giving that c s ^ Ka^s/n. Thus, we would have \J c 2 s + a 2 s/n < 
ay/s/n, implying that the quantity a-\/ s/n is the ideal goal for the rate of convergence. If we 
knew the oracle model T, we would achieve this rate by using the oracle estimator, the least 
squares estimator based on this model. Of course, we do not generally know T since we do 

^By definition the oracle knows the risk function of any estimator, so it can compute the best sparse least 
square estimator. Under some mild condition the problem of minimizing prediction risk amongst all sparse least 
square estimators is equivalent to the problem written here; see, e.g., Belloni and Chernozhukov (2011b). 
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not observe the /(zj)'s and thus cannot attempt to solve the oracle problem (|2.3p . Since T is 
unknown, we will not generally be able to achieve the exact oracle rates of convergence, but 
we can hope to come close to this rate. 

Before considering estimation methods, a natural question is whether exact or approximate 
HDS models make sense in econometric applications. In order to answer this question, it is 
helpful to consider the following two examples in which we abstract from estimation completely 
and only ask whether it is possible to accurately describe some structural econometric function 
f(z) using a low-dimensional approximation of the form P(z)'/3q. 

Example 1: Sparse Models for Earning Regressions. In this example we consider a 
model for the conditional expectation of log- wage given education Zi, measured in years of 
schooling. We can expand the conditional expectation of wage yi given education z^. 

v 

E[y i \z i ] = ^p Qj P j (z i ), (2.4) 
i=i 

using some dictionary of approximating functions P(zi) = (Pi(zj), . . . , P p (zi))', such as poly- 
nomial or spline transformations in Z{ and/or indicator variables for levels of z%. In fact, 
since we can consider an overcomplete dictionary, the representation of the function using 
P\{zi), . . . ,P p (zi) may not be unique, but this is not important for our purposes. 

A conventional sparse approximation employed in econometrics is, for example, 

f( Zi ) := E[ yi \ Zi ] = hPM) + ■■■ + hPs{zi) + fi, (2.5) 

where the Pj's are low-order polynomials or splines, with typically one or two (linear or linear 
and quadratic) terms. Of course, there is no guarantee that the approximation error fj in this 
case is small or that these particular polynomials form the best possible s-dimensional approx- 
imation. Indeed, we might expect the function £[2/3 to change rapidly near the schooling 
levels associated with advanced degrees, such as MBAs or MDs. Low-degree polynomials may 
not be able to capture this behavior very well, resulting in large approximation errors fj. 

A sensible question is then, "Can we find a better approximation that uses the same number 
of parameters?" More formally, can we construct a much better approximation of the sparse 
form 

f(zi) := E[ yi \zi\ = (3 kl P kl (zi) + ■■■ + f3 ks P ks (zi) + n, (2.6) 

for some regressor indices k±, . . . ,k s selected from {1, . . . ,p}? Since we can always include (12. 5p 
as a special case, we can in principle do no worse than the conventional approximation; and, in 
fact, we can construct (12 . 6f) that is much better, if there are some important higher-order terms 
in (|2.4p that are completely missed by the conventional approximation. Thus, the answer to 
the question depends strongly on the empirical context. 

Consider for example the earnings of prime age white males in the 2000 U.S. Census see, e.g., 
Angrist, Chernozhukov, and Fernandez- Val (2006). Treating this data as the population data, 
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Sparse Approximation 


Li error 


Loo error 


Conventional 


0.12 


0.29 


Lasso 


0.08 


0.12 


Post-Lasso 


0.04 


0.08 



TABLE 1 . Errors of Conventional and the Lasso-based Sparse Approximations of the Earning 
Function. The Lasso method minimizes the least squares criterion plus the £i-norm of the 
coefficients scaled by a penalty parameter A. The nature of the penalty forces many coefficients 
to zero, producing a sparse fit. The Post-Lasso minimizes the least squares criterion over 
the non-zero components selected by the Lasso estimator. This example deals with a pure 
approximation problem, in which there is no noise. 



we can compute f(zi) = E[yi\zi] without error. Figure [U plots this function. We then construct 
two sparse approximations and also plot them in Figure [TJ The first is the conventional 
approximation of the form (|2.5p with P±,...,P S representing polynomials of degree zero to 
s — 1 (s = 5 in this example). The second is an approximation of the form ()2.6|) . with P^, . . . , 
Pk s consisting of a constant, a linear term, and three linear splines terms with knots located 
at 16, 17, and 19 years of schooling. We find the latter approximation automatically using 
the ^-penalization or Lasso methods discussed belowj^ although in this special case we could 
construct such an approximation just by eye-balling Figure Q] and noting that most of the 
function is described by a linear function with a few abrupt changes that can be captured by 
linear spline terms that induce large changes in slope near 17 and 19 years of schooling. Note 
that an exhaustive search for a low-dimensional approximation in principle requires looking 
at a very large set of models. Methods for HDS models, such as ^i-penalized least squares 
(Lasso), which we employed in this example, are designed to avoid this search. □ 

Example 2: Series approximations and Condition ASM. It is clear from the state- 
ment of Condition ASM that this expansion incorporates both substantial generalizations and 
improvements over the conventional series approximation of regression functions in Newey 
(1997). In order to explain this consider the set {Pj(z),j ^ 1} of orthonormal basis functions 
on [0, l] d , e.g. orthopolynomials, with respect to the Lebesgue measure. Suppose z% have a 
uniform distribution on [0, l] d for simplicity!^ Assuming E[/ 2 (zj)] < oo, we can represent / 
via a Fourier expansion, f(z) = ^2^i^jPj(z), where {5j,j ^ 1} are Fourier coefficients that 
satisfy Y%=\ 6 ] < 00 • 

Let us consider the case that / is a smooth function so that Fourier coefficients fea- 
ture a polynomial decay 5j oc j~ v , where v is a measure of smoothness of /. Consider 

2 The set of functions considered consisted of 12 linear splines with various knots and monomials of degree 
zero to four. Note that there were only 12 different levels of schooling. 

The discussion in this example continues to apply when Zi has a density that is bounded from above and 
away from zero on [0, l] d . 
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Traditional vs Lasso approximations 
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FIGURE 1 . The figures illustrates the Post-Lasso sparse approximation and the 
fourth order polynomial approximation of the wage function. 

the conventional series expansion that uses the first K terms for approximation, f{z) = 
^2ij=\ PojPj(z) + a c (z), with (3oj = Sj. Here a c (zi) is the approximation error which obeys 
-y/E n [a^(zj)] <p yT^a 2 ^)] < K 2^. Balancing the order K ^~ of approximation error 
with the order WK/n of the estimation error gives the oracle-rate-optimal number of series 
terms s = K oc n 1 / 2 ^, and the resulting oracle series estimator, which knows s, will estimate / 
at the oracle rate ofn 4» . This also gives us the identity of the most important series terms 
T= {1, ...,s}, which are simply the first s terms. We conclude that Condition ASM holds for 
the sparse approximation f(z) = X^?=i PojPj( z ) + a ( z )i with /3oj = Sj for j ^ s and /?oj = for 
s + 1 $5 j ^ p, and a(zi) = a c (zi), which coincides with the conventional series approximation 
above, so that y / E n ,[a 2 (zj)] <p ysjn and ||/3o||o ^ s - 

Next suppose that Fourier coefficients feature the following pattern Sj = for j ^ M and 
Sj oc (j — M)~ v for j > M. Clearly in this case the standard series approximation based on 
the first K ^ M terms, Ylj=i &jfj{ z )i has no predictive power for f(z), and the corresponding 
standard series estimator based on the first K terms therefore fails completely]^] In contrast, 
Condition ASM is easily satisfied in this case, and the Lasso-based estimators will perform 
at a near-oracle level in this case. Indeed, we can use the first p series terms to form the 
approximation f(z) = Ylj=i 0QjPj( z ) + a { z )-> where /3oj = for j ^ M and j > M + s, fioj = Sj 
for M + 1 ^ j ^ M + s with s oc n 1 / 2 ", and p such that M + n x l 2v = o(p). Hence ||/3 ||o = s, 
and we have that y/E n [a 2 (zi)] <p ^/E[a 2 (zi)] < yfsjn < n~^r . □ 



This is not merely a finite sample phenomenon but is also accommodated in the asymptotics since we 
expressly allow for array asymptotics; i.e. the underlying true model could change with n. Recall that we omit 
the indexing by n for ease of notation. 
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3. Sparse Estimation Methods 

3.1. ^i-penalized and post ^-penalized estimation methods. In order to discuss es- 
timation consider first, as a matter of motivation, the classical AIC/BIC type estimator 
(Akaike 1974, Schwarz 1978) that solves the empirical (feasible) analog of the oracle prob- 
lem: 

minE n [(y,-^) 2 ] + ^||^||o, 

where A is a penalty levelj§ This estimator has attractive theoretical properties. Unfortunately, 
it is computationally prohibitive since the solution to the problem may require solving ^ fc<n (?) 
least squares problems @ 

One way to overcome the computational difficulty is to consider a convex relaxation of the 
preceding problem, namely to employ a closest convex penalty - the t\ penalty - in place of 
the £q penalty. This construction leads to the so called Lasso estimator /3 (Tibshirani 1996), 
defined as a solution for the following optimization problem: 

mmE n [(y l -x' l f3f] + -\\l3\\ 1 , (3.7) 

where ||/3||i = X^j=i The Lasso estimator is computationally attractive because it min- 
imizes a convex function. A basic choice for penalty level suggested by Bickel, Ritov, and 
Tsybakov (2009) is 



A = 2 • co- v / 2nlog(2p/7). (3.8) 

where c > 1 and 1 — 7 is a confidence level that needs to be set close to 1. The formal motivation 
for this penalty is that it leads to near-oracle rates of convergence of the estimator. 

The penalty level specified above is not feasible since it depends on the unknown a. Belloni 
and Chernozhukov (2011c) propose to set 

A = 2-ca$ -1 (l-7/2p), (3.9) 

with a = a + op(l) obtained via an iteration method defined in Appendix A, where c > 1 and 
1 — 7 is a confidence level@ Belloni and Chernozhukov (2011c) also propose the AT-dependent 
penalty level: 

A = c- 2oA(l - j\X), (3.10) 

where 

A(l - j\X) = (1 - 7) - quantile of nllE^^H^ | X 



^The penalty level A in the AIC/BIC type estimator needs to account for the noise since it observes yi instead 
of f(zi) unlike the oracle problem (|2.3[) . 

^Results on the computational intractability of this problem were established in Natarajan (1995), Ge, Jiang, 
and Ye (2011) and Chen, Ge, Wang, and Ye (2011). 

7 

Practical recommendations include the choice c = 1.1 and 7 = .05. 
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where X = [x\, . . . ,x n ]' and gt are i.i.d. iV(0, 1) , which can be easily approximated by 
simulation. We note that 



A(l -l\X) < v^* _1 (l " 7/?p) < V2nlog(2p/ 7 ), (3.11) 



so y2n log(2p/7) provides a simple upper bound on the penalty level. Note also that Belloni, 
Chen, Chernozhukov, and Hansen (2010) formulate a feasible Lasso procedure for the case with 
heteroscedastic, non-Gaussian disturbances. We shall refer to the feasible Lasso method with 
the feasible penalty levels (|3.9p or f|3. 1Q|) as the Iterated Lasso. This estimator has statistical 
performance that is similar to that of the (infeasible) Lasso described above. 

Belloni, Chernozhukov, and Wang (2011) propose a variant called the Square-root Lasso 
estimator (3 defined as a solution to the following program: 

A, 



with the penalty level 



where c > 1 and 



^E n [(y t -x>py] + -\\f3\\ 1 , (3.12) 
A = c • A(l - j\X), (3.13) 



A(l - j\X) = (1 - 7 ) - quantile of n||E n [x i5i ] Hoo/yEn^] I X, 

with gi ~ N(0, 1) independent for i = 1, . . . , n. As with Lasso, there is also simple asymptotic 
option for setting the penalty level: 

A = c-$~ 1 (l-7/2p). (3.14) 

The main attractive feature of (13.12j) is that the penalty level A is independent of the value a, 
and so it is pivotal with respect to that parameter. Nonetheless, this estimator has statistical 
performance that is similar to that of the (infeasible) Lasso described above. Moreover, the 
estimator is a solution to a highly tractable conic programming problem: 

A, 



min t + 



■■ y/Eniiyi-xtfW^t, (3.15) 



where the criterion function is linear in parameters t and positive and negative components of 
j3, while the constraint can be formulated with a second-order cone, informally known also as 
the "ice-cream cone". 

There are several other estimators that make use of penalization by the £i-norm. An impor- 
tant case includes the Dantzig selector estimator proposed and analyzed by Candes and Tao 
(2007). It also relies on £i-regularization but exploits the notion that the residuals should be 
nearly uncorrelated with the covariates. The estimator is defined as a solution to: 



mm 



: \\E n [ Xl ( yi - x'MWoo ^ X/n (3.16) 

where A = <rA(l — 7|AT). In what follows we will focus our discussion on Lasso but virtually 
all theoretical results carry over to other t\ -regularized estimators including (I3.12P and (I3.16p . 
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We also refer to Gautier and Tsybakov (2011) for a feasible Dantzig estimator that combines 
the square-root lasso method (|3.15p with the Dantzig method. 

^i-regularized estimators often have a substantial shrinkage bias. In order to remove some 
of this bias, we consider the post-model-selection estimator that applies ordinary least squares 
regression to the model T selected by a ^i-regularized estimator /3. Formally, set 

f = support^) = {je : >0}, 

and define the post model selection estimator /3 as 

P G arg min E n [(y* - x-/3) 2 ] : fa = for each j £ f c , (3.17) 

where T c = {1, ...,p} \ T. In words, the estimator is ordinary least squares applied to the data 
after removing the regressors that were not selected in T. When the ^-regularized method used 
to select the model is Lasso (Square-root Lasso), the post-model-selection estimator is called 
Post-Lasso (Post-Square-root Lasso). If model selection works perfectly - that is, T = T - then 
the post-model-selection estimator is simply the oracle estimator whose properties are well- 
known. However, perfect model selection is unlikely in many situations, so we are interested 
in the properties of the post-model-selection estimator when model selection is imperfect, i.e. 
when T/T, and are especially interested in cases where T ^ T. In Section 0] we describe the 
formal properties of the Post-Lasso estimator. 

3.2. Some Heuristics via Convex Geometry. Before proceeding to the formal results on 
estimation, it is useful to consider some heuristics for the ^i-penalized estimators and the 
choice of the penalty level. For this purpose we consider a parametric model, and a generic 
^i-regularized estimator based on a differentiable criterion function Q: 

/3 £ arg min Q(f3) + -||/3||i, (3.18) 

where, e.g., Q(/3) = ^ n \{Vi ~ x 'iP) 2 ] for Lasso and Q(/3) = -y/E n [(yj — £-/3) 2 ] for Square-root 
Lasso. The key quantity in the analysis of (13.18|) is the score - the gradient of Q at the true 
valua§: 

S = VQ(/3 ). 

The score S is the effective "noise" in the problem that should be dominated by the regu- 
larization. However we would like to make the regularization bias as small as possible. This 
reasoning suggests choosing the smallest penalty level A that is large enough to dominate the 
noise with high probability, say 1 — 7, which yields 

A > cA, for A := n||5||oo, (3.19) 

where A is the maximal score scaled by n, and c > 1 is a theoretical constant of Bickel, Ritov, 
and Tsybakov (2009) that guarantees that the score is dominated. We note that the principle 

In the case of a nonparametric model the score is similar to the gradient of Q at po but ignores the 
approximation errors n's. 
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of setting A to dominate the score of the criterion function is a general principle that carries 
over to other convex problems with possibly non-differentiable criterion functions and that 
leads to the optimal - near-oracle - performance of ^i-penalized estimators. See, for instance, 
Belloni and Chernozhukov (2011a). 

It is useful to mention some simple heuristics for the principle (|3.19p which arise from 
considering the simplest case where none of the regressors are significant so that (3q = 0. We 
want our estimator to perform at a near-oracle level in all cases, including this case, but here 
the oracle estimator {3* sets (3* = (3q = 0. We also want f3 = /3q = in this case, at least with 
a high probability, say 1 — 7. Prom the subgradient optimality conditions for (|3.18p . we must 
have 

-Sj + A/n > and Sj + A/n > for all 1 ^ j ^ p 

for this to be true. We can only guarantee this by setting the penalty level A/n such that 
A > nmaxi^j^p \Sj\ = n,||iS r || 00 with probability at least 1 — 7. This is precisely the rule (|3.19p 
appearing above. 

Finally, note that in the case of Lasso and Square-root Lasso we have the following expres- 
sions for the score: 

Lasso : S = 2E n [xjej] = d 2aE n [xigi], 

Square-root Lasso : b = — ^=^= =d — , 

where gi are i.i.d. iV(0, 1) variables. Note that the score for Square-root Lasso is pivotal, 
while the score for Lasso is not, as it depends on a. Thus, the choice of the penalty level for 
Square-root Lasso need not depend on a to produce near-oracle performance for this estimator. 

3.3. Beyond Mean Models. Most of the literature on high dimensional sparse models fo- 
cuses on the mean regression model discussed above. Here we discuss methods that have been 
proposed to deal with quantile regression and generalized linear models in high-dimensional 
sparse settings. We assume i.i.d. sampling for (yi,Xi) in this subsection. 

3.3.1. Quantile Regression. We consider a response variable yi and p-dimensional covariates 
Xi such that the n-th conditional quantile function of yi given xi is given by 

F-}.{u\x) = x'f3(u), (3(u)£R p , (3.20) 

where u G (0, 1) is quantile index of interest. Recall that the u-th conditional quantile 
F~} {u\x) is the inverse of the conditional distribution function F y .\ x .(y\x) of yi given Xi = x. 
Suppose that the true model j3{u) has a sparse support: 

r u = support(/3(n)) = {jG{l,...,p} : \^{u)\ > 0} 

has only s u ^ s ^ n/log(n V p) non-zero components. 



12 



BELLONI CHERNOZHUKOV HANSEN 



The population coefficient /3(u) is known to be a minimizer of the criterion function 

Q u (P) = E[ Pu ( yi -x>M, (3-21) 

where p u (t) = (u — l{t ^ 0})t is the asymmetric absolute deviation function; see Koenker 
and Bassett (1978). Given a random sample (yi, xi), . . . , (y n ,x n ), f3(u), the quantile regression 
estimator of /3(u), is defined as a minimizer of the empirical analog of (I3.2ip : 

QM = E n [p u { yi - x'rf)] . (3.22) 

As before, in high-dimensional settings, ordinary quantile regression is generally not consistent, 
which motivates the use of penalization in order to remove all, or at least nearly all, regressors 
whose population coefficients are zero. The ^-penalized quantile regression estimator /3(u) is 
a solution to the following optimization problem: 



mm Q u (fl + *VS U l mi . (3.23) 

The criterion function in (13.23H is the sum of the criterion function (13. 22ft and a penalty function 
given by a scaled ^i-norm of the parameter vector. 

In order to describe choice of the penalty level A, we introduce the random variable 

Xjjju - l{Uj < u}) 



A = n max 



E, 



V^(T 



(3.24) 



where ui,...,u n are i.i.d. uniform (0,1) random variables, independently distributed from 
the regressors, x\, . . . ,x n . The random variable A has a pivotal distribution conditional on 
X = f x\, . . . ,x n ]'. Then, for c > 1, Belloni and Chernozhukov (2011a) propose to set 

A = c • A(l - -y\X), where A(l - 7] X) := (1 - 7)-quantile of A conditional on X, (3.25) 

and 1 — 7 is a confidence level that needs to be set close to 1. 

The post-penalized estimator (post-^i-QR) applies ordinary quantile regression to the model 
T u selected by the ^-penalized quantile regression (Belloni and Chernozhukov 2011a). Specif- 
ically, set 

T u = support^u) ) = {je{l,...,p} : %{u)\ > 0}, 
and define the post-penalized estimator f3(u) as 

j9(u) € arg min Q u (f3) : = 0, j£ f c u (3.26) 



which is just ordinary quantile regression removing the regressors that were not selected in 
the first step. Belloni and Chernozhukov (2011a) derive the basic properties of the estimators 
above; see also Kato (2011) for further important results in nonparametric setting, where group 
penalization is also studied. 
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3.3.2. Generalized Linear Models. From the discussion above, it is clear that ^i-regularized 
methods can be extended to other criterion functions Q beyond least squares and quantile 
regression, ^-regularized generalized linear models were considered in van de Geer (2008). 
Let y € M denote the response variable and ieP the covariates. The criterion function of 
interest is defined as 



l " 



h(yi,x'iP) 



where h is convex and 1-Lipschitz with respect the second argument, \h(y,t) — h(y,t')\ \t— 1'\. 
We assume h is differentiable in the second argument with derivative denoted V/i to simplify 
exposition. Let the true model parameter be defined by /3q £ arg min/jgRp E[/i(j/j, x^/3)], and 
consequently we have E[xjV/i(y.j, x[Po)] = 0. The ^-regularized estimator is given by the 
solution of 

min Q(f3) + -\ 



n 



Under high level conditions van de Geer (2008) derived bounds on the excess forecasting 
loss, E[h(yi, x'i/3)] — E[h(yi, x'^q)], under sparsity-related assumptions, and also specialized the 
results to logistic regression, density estimation, and other problems^ The choice of penalty 
parameter A derived in van de Geer (2008) relies on using the contraction inequalities of Ledoux 
and Talagrand (1991) in order to bound the score: 



n\\VQ(J3o)\\ ( 



^x.jV/i(?/i,3;-/3o) 



i=l 



^ 1 x i£i 



(3.27) 



where & are independent Rademacher random variables, P(£i = 1) = P(£i = —1) = 1/2. 
Then van de Geer (2008) suggests further bounds on the right side of (|3.27p . For efficiency 
reasons, we suggest simulating the 1 — 7 quantiles of the right side of (|3.27p conditional on 
regressors. In either way one can achieve the domination of "noise" A/n ^ c\\ V<5(/3o)||oo with 
high probability. Note that since h is 1-Lipschitz, this choice of the penalty level is pivotal. 



4. Estimation Results for High Dimensional Sparse Models 

4.1. Convergence Rates for Lasso and Post-Lasso. Having introduced Condition ASM 
and the target parameter defined via (|2.3p . our task becomes to estimate /3o- We will focus 
on convergence results in the prediction norm for S = {3 — /?o, which measures the accuracy of 
predicting x'^o over the design points x\, • • • , x n , 

\\S\\ 2 ,n ■= ^nKx'S 2 } = yJS'EnixiX'tf. 

The prediction norm directly depends on the the Gram matrix E n [xiX-]. Whenever p > n, 
the empirical Gram matrix E n [xjX-] does not have full rank and in principle is not well-behaved. 



Results in other norms of interest could also be derived, and the behavior of the post-£i-regularized estima- 
tors would also be interesting to consider. This is an interesting venue for future work. 
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However, we only need good behavior of certain moduli of continuity of the Gram matrix called 
sparse eigenvalues. We define the minimal m-sparse eigenvalue of a semi-definite matrix M as 

^mm(m)[M] 
and the maximal m-sparse eigenvalue as 

max (m)[M] 
To assume tllclt ^min 

(m)[E n [xjX^]] > requires that all empirical Gram submatrices formed 
by any m components of Xi are positive definite. To simplify asymptotic statements for Lasso 
and Post-Lasso, we use the following condition: 

Condition SE. There is i n — > oo such that 

(inS^Enlxix'j]] ^ (j) max (£ ri s)[E n [x i x' i \] < K , 

where < k' < k" < oo are constants that do not depend on n. 

Comment 4.1. It is well-known that Condition SE is quite plausible for many designs of 
interest. For instance, Condition SE holds with probability approaching one as n — > oo if x% is 
a normalized form of x\ , namely Xij = Xij / y / 'E n [x 2 j], and 

• Xi, i = l,...,n, are i.i.d. zero-mean Gaussian random vectors that have population 
Gram matrix E[£j£^] with ones on the diagonal and its minimal and maximal slogn- 
sparse eigenvalues bounded away from zero and from above, where s log n = o{n/ log p); 

• Xi, i = 1, . . . , n, are i.i.d. bounded zero-mean random vectors with ||xj||oo ^ K n a.s. 
that have population Gram matrix E[xjX-] with ones on the diagonal and its minimal 
and maximal s log n-sparse eigenvalues bounded from above and away from zero, where 
K%slog 5 (p V n) = o(n). 

Recall that a standard assumption in econometric research is to assume that the population 
Gram matrix E[xjX^] has eigenvalues bounded from above and below, see e.g. Newey (1997). 
The conditions above allow for this and more general behavior, requiring only that the slogn 
sparse eigenvalues of the population Gram matrix E[xjX-] are bounded from below and from 
above. The latter is important for allowing functions Xi to be formed as a combination of 
elements from different bases, e.g. a combination of B-splines with polynomials. □ 

The following theorem describes the rate of convergence for feasible Lasso in the Gaussian 
model under Conditions ASM and SE. We formally define the feasible Lasso estimator f3 as 
either the Iterated Lasso with penalty level given by JT-independent rule (|3.9p or X-dependent 
rule (|3.10p or Square-root Lasso with penalty level given by X-dependent rule (|3. 13|) or X- 
independent rule (I3.14D . with the confidence level 1 — 7 such that 



6'M5 tA . 

\\o<m,5^0 \\S\\ 2 



max -777779- , (4.29) 

Ho^m^O \\5\\ 2 



7 = o(l) and log(l/7) < log(p V n). 



(4.30) 
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Theorem 1 (Rates for Feasible Lasso). Suppose that conditions ASM and SE hold. Then for 
n large enough the following bounds hold with probability at least 1 — 7: 

where C > and C > are constants, C > and C < l/Vrf, and \og(p/j) < log(p V n). 

Comment 4.2. Thus the rate for estimating /3o is y^s/n, i.e. the root of the number of 
parameters s in the "true" model divided by the sample size n, times a logarithmic factor 
\/log(p V n). The latter factor can be thought of as the price of not knowing the "true" model. 
Note that the rate for estimating the regression function / over design points follows from the 
triangle inequality and Condition ASM: 

^E n [(f( Zi )-x0r] ^ 0-Poh, n + c s < P <rJ Sl ° g(pVn) . (4.31) 

V Th 

Comment 4.3. The result of TheoremQ]is an extension of the results in the fundamental work 
of Bickel, Ritov, and Tsybakov (2009) and Meinshausen and Yu (2009) on infeasible Lasso and 
Candes and Tao (2007) on the Dantzig estimator. The result of Theorem Q] is derived in Belloni 
and Chernozhukov (2011c) for Iterated Lasso, and in Belloni, Chernozhukov, and Wang (2011) 
and Belloni, Chernozhukov, and Wang (2010) for Square-root Lasso (with constants C given 
explicitly). Similar results also hold for £i-QR (Belloni and Chernozhukov 2011a) and other 
M-estimation problems (van de Geer 2008). The bounds of Theorem [T] allow the constructions 
of confidence sets for /3q, as noted in Chernozhukov (2009); see also Gautier and Tsybakov 
(2011). Such confidence sets rely on efficiently bounding C. Computing bounds for C requires 
computation of combinatorial quantities depending on the unknown model T which makes the 
approach difficult in practice. In the subsequent sections, we will present completely different 
approaches to inference which have provable confidence properties for parameters of interest 
and which are computationally tractable. □ 

As mentioned before, ^-regularized estimators have an inherent bias towards zero and Post- 
Lasso was proposed to remove this bias, at least in part. It turns out that we can bound the 
performance of Post-Lasso as a function of Lasso's rate of convergence and Lasso's model 
selection ability. For common designs, this bound implies that Post-Lasso performs at least 
as well as Lasso, and it can be strictly better in some cases. Post-Lasso also has a smaller 
shrinkage bias than Lasso by construction. 

The following theorem applies to any Post-Lasso estimator (3 computed using the model 
T = support(/3) selected by a Feasible Lasso estimator f3 defined before Theorem [TJ 

Theorem 2 (Rates for Feasible Post-Lasso). Suppose the conditions of Theorem [I] hold and 
let e > 0. Then there are constants C and C £ such that with probability 1 — 7 

8= \f\ < C's, 
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and with probability 1 — 7 



Po\\<\\P-Poh,n< C £ a 



s log(p V n) 



11 



If further 



s\ = o(s) and T C T with probability approaching one, then 



a 



o(s) log(p V n) Is 
n V u 



(4.32) 



(4.33) 



IfT = T with probability approaching one, then Post-Lasso achieves the oracle performance 



A) 1 1 2,n <P o-sjsjn. 



(4.34) 



Comment 4.4. The theorem above shows that Feasible Post-Lasso achieves the same near- 
oracle rate as Feasible Lasso. Notably, this occurs despite the fact that Feasible Lasso may in 
general fail to correctly select the oracle model T as a subset, that is T % T. The intuition 
for this result is that any components of T that Feasible Lasso misses are very unlikely to 
be important. Theorem [2] was derived in Belloni and Chernozhukov (2011c) and Belloni, 
Chernozhukov, and Wang (2010). Similar results have been shown before for ^i-QR (Belloni 
and Chernozhukov 2011a), and can be derived for other methods that yield sparse estimators. 

□ 



4.2. Monte Carlo Example. In this section we compare the performance of various esti- 
mators relative to the ideal oracle linear regression estimator. The oracle estimator applies 
ordinary least square to the true model by regressing the outcome on only the control variables 
with non-zero coefficients. Of course, the oracle estimator is not available outside Monte Carlo 
experiments. 

We considered the following regression model: 

y = x% + e, fa = (1, 1, 1/2, 1/3, 1/4, 1/5, 0, . . . , 0)', 

where x = (l,z')' consists of an intercept and covariates z ~ iV(0, £), and the errors e are 
independently and identically distributed e ~ N(0,a 2 ). The dimension p of the covariates x 
is 500, and the dimension s of the true model is 6. The sample size n is 100. The regressors 
are correlated with = p^~^ and p = .5. We consider the levels of noise to be a = 1 and 
a = 0.1. For each repetition we draw new x's and e's. 

We consider infeasible Lasso and Post-Lasso estimators, feasible Lasso and Post-Lasso esti- 
mators described in the previous section, all with X-dependent penalty levels, as well as (5-fold) 
cross-validated (CV) Lasso and Post-Lasso. We summarize results on estimation performance 
in Table [2] which records for each estimator (3 the norm of the bias ||E[/S — /3o]|| and also 
the empirical risk {E[(a^(/3 — A))) 2 ]} 1 / 2 for recovering the regression function. In this design, 
infeasible Lasso, Square-root Lasso, and Iterated Lasso exhibit substantial bias toward zero. 
This bias is somewhat alleviated by choosing the penalty-level via cross-validation, though the 
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remaining bias is still substantial. It is also apparent that, as intuition and theory would sug- 
gest, the post-penalized estimators remove a large portion of this shrinkage bias. We see that 
among the feasible estimators, the best performing methods are the Post-Square-root Lasso 
and Post-Iterated Lasso. Interestingly, cross-validation also produces a Post-Lasso estimator 
that performs nearly as well, although the procedure is much more expensive computationally. 
The Post-Lasso estimators perform better than Lasso estimators primarily due to a much lower 
shrinkage bias which is beneficial in the design considered. 



Estimator 


High Noise (cr = 1) 

Bias Prediction Error 


Low Noise (cr = 0.1) 

Bias Prediction Error 


Lasso 


0.444 


0.654 


0.0487 


0.0700 


Post-Lasso 


0.129 


0.347 


0.0054 


0.0300 


Square-root Lasso 


0.526 


0.770 


0.0615 


0.0870 


Post-Square-root Lasso 


0.187 


0.364 


0.0035 


0.0238 


Iterated Lasso 


0.437 


0.644 


0.0477 


0.0687 


Post-Iterated Lasso 


0.133 


0.360 


0.0056 


0.0297 


CV Lasso 


0.265 


0.516 


0.0233 


0.0987 


CV Post-Lasso 


0.148 


0.415 


0.0035 


0.0237 


Oracle 


0.035 


0.238 


0.0035 


0.0237 



TABLE 2. The table displays the mean bias and the mean prediction error. The average 
number of components selected by Lasso was 5.18 in the high noise case and 6.44 in the low 
noise case. In the case of CV Lasso, the average size of the model was 29.6 in the high noise 
case and 10.0 in the low noise case. Finally, the CV Post-Lasso selected models with average 
size of 7.1 in the high noise case and 6.0 in the low noise case. 



5. Inference on Structural Effects with High-Dimensional Instruments 

5.1. Methods and Theoretical Results. In this section, we consider the linear instrumental 
variable (IV) model with many instruments. Consider the Gaussian simultaneous equation 
model: 

yii = V2iOL\ + w[a2 + Ch (5-35) 
V2i = f(zi) + Vi, (5.36) 

( Ci )\*,~n( ,( "l '<•)). (5.37) 



Here yn is the response variable, y 2 i is the endogenous variable, Wi is a fc^-vector of control 
variables, Z\ = (u^w'j)' is a vector of instrumental variables (IV), and (d,Vi) are disturbances 
that are independent of z$. The function f(zt) = E[y2i\zi], the optimal instrument, is an 
unknown, potentially complicated function of the elementary instruments z%. The main pa- 
rameter of interest is the coefficient on y 2 i, whose true value is a\. We treat {zi} as fixed 
throughout. 
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Based on these elementary instruments, we create a high-dimensional vector of technical 
instruments, X{ = P(zi), with dimension p possibly much larger than the sample size though 
restricted via conditions stated below. We then estimate the the optimal instrument f(zi) by 

f{zi) = x0, (5.38) 

where /3 is a feasible Lasso or Post-Lasso estimator as formally defined in the previous section. 

Sparse-methods take advantage of approximate sparsity and ensure that many elements of 
f5 are zero when p is large. In other words, sparse-methods will select a small subset of the 
available technical instruments. Let Ai = {f{zi),w' i ) 1 be the ideal instrument vector, and let 

A i = (f(z i ),w'i)' (5.39) 

be the estimated instrument vector. Denoting di = (y%i, w[)\ we form the feasible IV estimator 
using the estimated instrument vector as 

a* = (EniAid'il) _1 (M n [A iyii ]) . (5.40) 



The main regularity condition is recorded as follows. 



Condition ASIV. In the linear IV model \5. 35\) - [5.37 ) with technical instruments Xi = 
P(zi), the following assumptions hold: (i) the parameter values a v , ctq and the eigenvalues 
of Q n — E n L4i^] are bounded away from zero and from above uniformly in n, (ii) condition 
ASM holds for 15. 36\) . namely for each i = l,...,n, there exists /3q £ R p , such that f{zi) = 
x 'iPo + r ij ||A)|| ^ s ) {^n^f]} 1 ^ 2 ^ Ka v y/s/n, where constant K does not depend on n, (Hi) 
condition SE holds for E n [xjX^], and (iv) s 2 log 2 (p V n) = o(n). 

The main inference result is as follows. 

Theorem 3 (Asymptotic Normality for IV Estimator Based on Lasso and Post-Lasso). Sup- 
pose Condition ASIV holds. The IV estimator constructed in |5.^Q[ ) is yjn-consistent and is 
asymptotically efficient, namely as n grows: 

{olQ- l y 1/2 ^/n~(a* -a) = N(0, 1) + o P (l), 

and the result also holds with Q n replaced by Q n = E n [^4j^4-] and a 2 by a 2 = K n [(yu — A^a*) 2 ]. 

Comment 5.1. The theorem shows that the IV estimator based on estimating the first-stage 
with Lasso or Post-Lasso is asymptotically as efficient as the infeasible optimal IV estimator 
that uses Ai and thus achieves the semi-parametric efficiency bound of Chamberlain (1987). 
Belloni, Chernozhukov, and Hansen (2010) show that the result continues to hold when other 
sparse methods are used to estimate the optimal instruments. The sufficient conditions for 
showing the IV estimator obtained using sparse-methods to estimate the optimal instruments 
is asymptotically efficient include a set of technical conditions and the following key growth 
condition: s 2 log 2 (p V n) = o(n). This rate condition requires the optimal instruments to be 
sufficiently smooth so that a relatively small number of series terms can be used to approximate 
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them well. This smoothness ensures that the impact of instrument estimation on the IV estima- 
tor is asymptotically negligible. The rate condition s 2 log 2 (p\/n) = o(n) can be substantive and 
cannot be substantially weakened for the full-sample IV estimator considered above. However, 
we can replace this condition with the weaker condition that s log(p Vti) = o(n) by employing 
a sample splitting method from the many instruments literature (Angrist and Krueger 1995) 
as established in Belloni, Chernozhukov, and Hansen (2010) and Belloni, Chen, Chernozhukov, 
and Hansen (2010). Moreover, Belloni, Chen, Chernozhukov, and Hansen (2010) show that 
the result of the theorem, with some appropriate modifications, continues to apply under het- 
eroscedasticity though the estimator does not necessarily attain the semi-parametric efficiency 
bound. In order to achieve full efficiency allowing for heteroscedasticity, we would need to 
estimate the conditional variance of the structural disturbances in the second stage equation. 
In principle, this estimation could be done using sparse methods. □ 

5.2. Weak Identification Robust Inference with Very Many Instruments. Consider 
the simultaneous equation model: 

Vli = V2%OL\ + w'^ +Ch (i\ Zi~ N (0, 0^) , (5.41) 

where yu is the response variable, y 2 % is the endogenous variable, Wi is a /c^-vector of control 
variables, z$ = (u^w^)' is a vector of instrumental variables (IV), and Q is a disturbance that 
is independent of Zj. We treat {zi} as fixed throughout. 

We would like to use a high-dimensional vector x, = P(zi) of technical instruments for 
inference that is robust to weak identification. We propose a method for inference based on 
inverting pointwise tests performed using a sup-score statistic defined below. The procedure is 
similar in spirit to Anderson and Rubin (1949) and Staiger and Stock (1997) but uses a very 
different statistics that is well-suited to cases with very many instruments. 

In order to formulate the sup-score statistic, we first partial-out the effect of controls w% on 
the key variables. For an n- vector {itj, i = 1, n}, define Ui = Ui — w' i E n {wiw' i \~ 1 E n {wiUi], i.e. 
the residuals left after regressing this vector on {wi,i = l,...,n}. Hence yu, y 2 i, and Xij are 
residuals obtained by partialling out controls. Also, let Xi = (xn, ...,Xi p )'. In this formulation, 
we omit elements of Wi from x^ since they are eliminated by partialling out. We then normalize 
without loss of generality 

E n [x 2 i:j ] = l, j = l,..., p. (5.42) 

The sup-score statistic for testing the hypothesis a.\ = a takes the form: 

\nE n [(y u - y 2 ia)xij}\ 
A a = max — =. 

If the hypothesis a\ = a is true, then the critical value for achieving level 7 is 

A(l - >y\W,X) = 1 - 7 - quantile of max L ^gf^ I W X (5.43) 
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where W = [w±, ...,w n ]', X = [x\, ...,x n ]', and g±, ■■■,g n are i.i.d. A^O, 1) variables independent 
of W and X; gi denotes the residuals left after projecting {g{\ on {wi} as defined above. We 
can approximate the critical value A(l — j\W, X) by simulation conditional on X and W. It 
is also possible to use a simple asymptotic bound on this critical value of the form 



A(l - 7) := cx/n$ -1 (l - 7/2p) < cy/2n log (2^/7), (5.44) 
for c > 1. The finite-sample (1 — 7) - confidence region for a\ is then given by 

C := {a G M : A a < A(l - ~f\W, X)}, 
while a large sample (1 — 7) - confidence region is given by C := {a £ 1 : A„ ^ A(l — 7)}. 
The main regularity condition is recorded as follows. 

Condition HDIV. Suppose the linear IV model ft5.41\ ) holds. Consider the p-vector of 
instruments Xi = P(zi), i = 1, ...,n, such that (logp)/n — > 0. Suppose further that the follow- 
ing assumptions hold uniformly in n: (i) the parameter value is bounded away from zero 
and from above, (ii) the dimension of Wi is bounded and the eigenvalues of the Gram matrix 
¥, n [wiw' i \ are bounded away from zero, (Hi) \\wi\\ ^ K and \xij\ ^ K for all 1 ^ i ^ n and all 
1 ^ j ^5 Pi where K is a constant, independent of n. 

The main inference result is as follows. 

Theorem 4 (Valid Inference based on the Sup-Score Statistic). (1) Suppose the linear IV 
model ^5.4 Ity holds. Then P(ai € C) = 1 — 7. (2) Suppose further that condition HDIV holds, 
then P(ai € C) ^ 1 — 7 — o(l). (3) Moreover, if a is such that that 

I a - ai\y/n\E n [y 2 iXij]\/y/]ogp 
max = > 00, 

Ki<P ac + la-a^E^ylx^} 
then P(a G C) = o(l) and P(a G C) = o(l). 

Comment 5.2. The theorem shows that the confidence regions C and C constructed above 
have finite-sample and large sample validity, respectively. Moreover, the probability of includ- 
ing a false point a in either C or C tends to zero as long as a is sufficiently distant from ol\ 
and instruments are not too weak. In particular, if there is a strong instrument, the confi- 
dence regions will eventually exclude points a that are further than W (log p)/n away from a±. 
Moreover, if there are instruments whose correlation with the endogenous variable is of greater 
order than W (log p)/n, then the confidence regions will asymptotically be bounded. Finally, 
note that a nice feature of the construction is that it provides provably valid confidence regions 
and does not require computation of some combinatorial quantities, in sharp contrast to other 
recent proposals for inference, e.g. Gautier and Tsybakov (2011). Lastly, we note that it is 
not difficult to generalize the results to allow for an increasing number of controls Wi under 
suitable technical conditions that restrict the number of controls and their envelope in relation 
to the sample size. Here we did not consider this possibility in order to highlight the impact of 
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very many instruments more clearly. The result (2) extends to non-Gaussian, heteroscedastic 
cases; we refer to Belloni, Chen, Chernozhukov, and Hansen (2010) for relevant details. □ 

Comment 5.3 (Inverse Lasso Interpretation). The construction of confidence regions above 
can be given the following Inverse Lasso interpretation. Let 

A P I " 

Pa = arg min E n [(y u - ay 2i ) - + - ^ \Pj\laj, 7aj = J^n[(yii ~ ma) 2 ^]. 

If A = 2A(1 - j\W,X), then C is equivalent to the region {a G K : /3 a = 0}. If A = 2A(1 - 
7), then C is equivalent to the region {a G M : (3 a = 0}. In words, to construct these 
confidence regions, we collect all potential values of the structural parameter, where the Lasso 
regression of the potential structural disturbance on the instruments yields zero coefficients on 
the instruments. This idea is akin to the Inverse Quantile Regression and Inverse Least Squares 
ideas in Chernozhukov and Hansen (2008a) and Chernozhukov and Hansen (2008b). □ 



5.3. Monte Carlo Example: Instrumental Variable Model. The theoretical results pre- 
sented in the previous sections suggest that using Lasso to aid in fitting the first-stage regression 
should result in IV estimators with good estimation and inference properties. In this section, 
we provide simulation evidence on these properties of IV estimators using iterated Lasso to 
select instrumental variables for a second-stage estimator. We also considered Square-root 
Lasso for variable selection. The results were similar to those for iterated Lasso, so we report 
only the iterated Lasso results. 

Our simulations are based on a simple instrumental variables model of the form 

n(o,( °< 





y 2 i = x'Jl + Vi \vi J II cr ?u 



where a = 1 is the parameter of interest, and X{ = (xn, a^ioo)' ~ AT(0, ^x) is the instrument 
vector with -Efx^J = o~\ and Corr(xj/ l , Xij) = .5^~ h K In all simulations, we set a 2 = 1 and 
a 2 = 1. We also use Corr(C'u) = -3. 

We consider several different settings for the other parameters. We provide simulation results 
for sample sizes, n, of 100 and 500. In one simulation design, we set n = and a 2 = 1. In this 
case, the instruments have no information about the endogenous variable, so a is unidentified. 
We refer to this as the "No Signal" design. In the remaining cases, we use an "exponential" 
design for the first stage coefficients, n, that sets the coefficient on Xih = .7 h ~ l for h = 1, 100 
to provide an example of Lasso's performance in settings where the instruments are informative. 
This model is approximately sparse, since the majority of explanatory power is contained in 
the first few instruments, and obeys the regularity conditions put forward above. We consider 
values of a 2 which are chosen to benchmark three different strengths of instruments. The three 
values of a 2 are found as a 2 = n ^gf n n for F* of 10, 40, or 160. 
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For each setting of the simulation parameter values, we report results from several estimation 
procedures. A simple possibility when presented with p < n instrumental variables is to just 
estimate the model using 2SLS and all of the available instruments. It is well-known that 
this will result in poor-finite sample properties unless there are many more observations than 
instruments; see, for example, Bekker (1994). Fuller's (1977) estimator (FULL0 is robust 
to many instruments as long as the presence of many instruments is accounted for when 
constructing standard errors and p < n; see Bekker (1994) and Hansen, Hausman, and Newey 
(2008) for example. We report results for these estimators in rows labeled 2SLS(A11) and 
FULL (All) respectively!"'"'] In addition, we report Fuller and IV estimates based on the set 
of instruments selected by Lasso with two different penalty selection methods. IV-Lasso and 
FULL-Lasso are respectively 2SLS and Fuller using instruments selected by Lasso with penalty 
obtained using the iterated method outlined in Appendix A. We use an initial estimate of the 
noise level obtained using the regression of y2 on the instrument that has the highest simple 
correlation with 2/2 • IV-Lasso-CV and FULL-Lasso-CV are respectively 2SLS and Fuller using 
instruments selected by Lasso using 10-fold cross-validation to choose the penalty level. We 
also report inference results based on the Sup-Score test developed in Section 5.2. 

In Table O we report root-mean-squared-error (RMSE), median bias (Med. Bias), rejection 
frequencies for 5% level tests (rp(.05)), and the number of times the Lasso-based procedures 
select no instruments (||II||o = 0). For computing rejection frequencies, we estimate conven- 
tional 2SLS standard errors for all 2SLS estimators, and the many instrument robust standard 
errors of Hansen, Hausman, and Newey (2008) for the Fuller estimators. In cases where Lasso 
selects no instruments, the reported Lasso point estimation properties are based on the feasible 
procedure that enforces identification by lowering the penalty until one variable is selected. 
Rejection frequencies in cases where no instruments are selected are based on the feasible 
procedure that uses conventional IV inference using the selected instruments when this set is 
non-empty and otherwise uses the Sup-Score test. 

The simulation results show that Lasso-based IV estimators is useful in situations with many 
instruments. As expected, 2SLS(A11) does extremely poorly along all dimensions. FULL(All) 
also performs worse than the Lasso-based estimators in terms of estimator risk (RMSE) in 
all cases. The Lasso-based procedures do not dominate FULL (All) in terms of median bias, 
though all of the Lasso-based procedures have smaller median bias than FULL (All) when 
n = 100 and there is some signal in the instruments and are very similar with n = 500. 
In terms of size of 5% level tests, we see that the Sup-Score test uniformly controls size as 
indicated by the theory. IV-Lasso and FULL-Lasso using the iterated penalty selection method 
also do a very good job controlling size across all of the simulation settings with a worst-case 
rejection frequency of .064 (with simulation standard error of .01) and the majority of rejection 



""^The Fuller estimator requires a user-specified parameter. We set this parameter equal to one which produces 



a higher-order unbiased estimator. See Hahn, Hausman, and Kuersteiner (2004) for additional discussion 

11 All models 
and FULL(All) 



"'""'"All models include an intercept. With n = 100, we randomly select 98 instruments to use for 2SLS(AU) 



HIGH-DIMENSIONAL SPARSE ECONOMETRIC MODELS 



23 



frequencies below .05. Interestingly, when there is no signal in the instrument, the Lasso-based 
estimators using penalty selected by CV have substantial size-distortions when n = 100 which 
is due to the CV penalty being small enough that instruments are still selected despite there 
being no signal. The iterated penalty is such that, at least approximately, only instruments 
whose coefficients are outside of a y/n neighborhood of are selected and thus overselection in 
cases with little signal is guarded against. Despite the problem with using CV when there is 
no signal, it is worth noting that the Lasso-based procedures with CV penalty produce tests 
with approximately correct size in all other parameter settings. 

To further examine the properties of the inference procedures that appear to give small 
size distortions, we plot the power curves of 5% level tests using the Sup-Score test and IV- 
Lasso with the iterated and CV penalty choices with n = 100 in Figure E0 We see that 
both the Sup-Score test and IV-Lasso using the iterated procedure augmented with Sup-Score 
test when no instruments are selected appear to uniformly control size and have some power 
against alternatives when the model is identified. It is also clear that of these two procedures, 
the IV-Lasso has substantially more power than the Sup-Score test. The figures also show that 
IV-Lasso with iterated penalty has almost as much power as IV-Lasso using the CV penalty 
while avoiding the substantial size distortion and spurious power produced by using CV when 
there is no signal. 

Overall, the simulation results are favorable to the Lasso-based IV methods. The Lasso- 
based estimators dominate the other estimators considered based on RMSE and have relatively 
small finite sample biases. The Lasso-based procedures also do a good job in producing tests 
with size close to the nominal level. There is some evidence that the Fuller-Lasso may do better 
than 2SLS-Lasso in terms of testing performance though these procedures are very similar in 
the designs considered. It also seems that tests based on IV-Lasso using the iterated penalty 
selection rule may perform better than tests based on IV-Lasso using cross-validation to choose 
the Lasso penalty levels, especially when there is little explanatory power in the instruments. 



6. Inference on Treatment and Structural Effects Conditional on 

Observables 

6.1. Methods and Theoretical Results. We consider the following partially linear model, 

Vii = dia.Q + g(zi) + Q, (6.45) 
di = m(zi) + Vi, (6.46) 

where di is a policy /treament variable whose impact we would like to infer, and Zi represents 
confounding factors on which we need to condition. This model is of interest in our international 



1 9 

The power curves in the n = 500 case are qualitatively similar. 
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Instrumental Variables Model Simulation Results 
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IV-Lasso 


0.049 


0.005 


0.064 





0.022 


0.002 


0.044 





FULL-Lasso 


0.049 


0.002 


0.056 





0.022 


0.001 


0.040 





IV-Lasso-CV 


0.048 


0.006 


0.054 





0.022 


0.002 


0.040 





FULL-Lasso-CV 


0.049 


0.003 


0.048 





0.022 


0.000 


0.038 





Sup-Score 






0.004 








0.010 




Table 3. 


Results 


are based on 


500 simulation replications. 


F* measures 


the strenj 


;th of 



the instruments as outlined in the text. We report root-mean-square-error (RMSE), median 
bias (Med. Bias), rejection frequency for 5% level tests (rp(.05)), and the number of times 
the Lasso-based procedures select no instruments (||n||o = 0). Further details are provided in 
the text. 

growth example discussed in the next section as well as in many empirical studies (Heckman, 
LaLonde, and Smith 1999, Imbens 2004). The confounding factors affect the policy variable 
via m{zi). We assume that m(zi) and g(zi) each admit an approximately sparse form and use 
linear combinations of technical control terms x% = P(zi) to approximate them. 
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n = 100, No Signal 



n = 100, F = 10 
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FIGURE 2. Power curves for Sup-Score test, IV-Lasso with Iterated penalty, and 
IV-Lasso with penalty selected by 10-Fold Cross- Validation from IV simulation 
with 100 observations. 



There are at least three obvious strategies for inference: 

(i) Estimate ao by applying a Feasible Lasso method to model (|6.45p without penalizing 

a , 

(ii) Estimate ao by applying a Post-Lasso method to model (|6.45p without penalizing ao, 

(iii) Estimate ao by applying an Indirect Post-Lasso where ao is estimated by running 
standard least squares regression of y on d and control terms selected in a preliminary 
Feasible Lasso regression of di on xi in (I6.46p . 

Note that it is most natural not to penalize ao since the goal is to quantify the impact of 
di. (The previous rate results derived in Theorems Q] and [2] for the regression function extend 
to the case where the coefficients on a fixed number of variables are not penalized.) In what 
follows, we shall refer to options (i), (ii), and (iii) respectively as Lasso, Post-Lasso, and Indirect 
Post-Lasso. 

Regarding inference, "intuition" suggests that if g can be estimated at faster than the n 1 / 4 
rate then any of (i)-(iii) could be -v/n-consistent and asymptotically normal. It turns out that 
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this "intuition" is often correct for options (ii) and (iii) but is wrong for option (i). Indeed, it 
is possible to show that under rather strong regularity conditions that 

(ofpM?]- 1 )- 1 /*^^ - «o) = N(0, 1) + o P (l), (6.48) 

where er^pE^?] -1 is the semi-parametric efficiency bound for estimating ao, for a denoting the 
estimators (ii) or (iii) above. Unfortunately, the distributional result (j6.48p is not very robust 
to modest violations of regularity conditions and may provide a poor approximation to the 
finite-sample distributions of the estimators for ao- The reason is that Lasso applied to f|6.45[) 
may miss important terms relating di to Z\ through m(zi) and thus suffer from substantial 
omitted variables bias. On the other hand, Lasso applied only to (|6.46p . even if successful in 
selecting adequate controls for the relationship between di and Zi, may miss important terms in 
g(zi) and thus be highly inefficient. We illustrate this lack of robustness through a simulation 
experiment reported below. 

Instead of using Lasso, Post-Lasso, or Indirect Post-Lasso, we advocate a "double-Post- 
Lasso" method. To define this estimator, we write the reduced form corresponding to ([6.450 - 

yu = a m(zi) + g(zi) + a Vi + Q, (6.49) 
di = m(zi) + Vi. (6.50) 

Now we have two equations and hence can apply Lasso methods to each equation to select 
control terms. That is, we run Lasso regression of yu on Xi = P(zi) and Lasso regression of 
di on Xi = P(zi). Then we can run least squares of yu on di and the union of the controls 
selected in each equation to estimate and perform inference on ao- By using this procedure we 
increase the chances for successfully recovering terms that approximate the key control term 
m(zi), which results in improved robustness properties. Indeed, the resulting procedure is 
considerably more robust in computational experiments and requires much weaker regularity 
conditions than the obvious strategies outlined above. 

Now we formally define the double- Post-Lasso estimator. Let I\ = support(/3i) denote 
the control terms selected by a feasible Lasso estimator j3\ computed using data (yi,Xi) = 
(di,Xi),i = l,...,n. Let I2 = support^) denote the control terms selected by a feasible 
Lasso estimator $2 computed using data (yi,Xi) = (yu,Xi),i = 1, n. The double-Post-Lasso 
estimator d of ao is defined as the least squares estimator obtained by regressing yu on di and 
the selected control terms Xij with j 6 / 2 I\ U 12- 

(a J) = argmin {E n [(y u - d { a - x'fif\ : ft = 0,Vj /}. 

The set / can contain other variables with names ^3 that the analyst may think are important 
for ensuring robustness. Thus, / = 1\ U U ^3; let s = \I\ and Sj = \Ij\ for j = 1, 2, 3. 

Condition ASTE. (i) The data (yu,di, z{),i = l,...,n, obeys model ((ff.^5p - f"(?.^7[ ) for each 
n, and Xi = P(z{) is a dictionary of transformations of Z{. (ii) The parameter values a\ and o~i 
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are bounded from above by a and away from zero, uniformly in n, and \o,q\ is bounded uniformly 
in n. (Hi) Regressor values Xi,i = l,...,n, obey the normalization condition E n [aA] = 1 for 
all j € {1, ...,p} and sparse eigenvalue condition SE. (iv) There exists s ^ 1 and /3 m o and f3 g o 
such that 

m(zi) = x-/3 m0 + r mi , ||/3 m0 ||o < s, {E n [r^J} 1/2 < Kay/sjn, (6.51) 
g(z i ) = x' i p g0 +r gi , \\f3 g0 \\ ^s, {E^}} 1 / 2 ^ Ka^TJ^, (6.52) 

where K is an absolute constant, independent of n, but all other parameter values can depend 
n. (v) s 2 log 2 (p V n) = o(n) and S3 < 1 V Si V S2. 

Theorem 5 (Inference on Treatment Effects). Suppose condition ASTE holds. The double- 
Post-Lasso estimator a obeys, 

(ofpM?]- 1 )- 1 / 2 ^* " «o) = N(0, 1) + op(l). 

Moreover, the result continues to apply if a 2 is replaced by a 2 = E n [(yij — di& — x-/3) 2 ](ra/(n — 
S - 1)) and E n [v 2 ] by E n [v 2 ] = min^ eRP {E n [(d i - x'^) 2 ] : = 0, Vj /}. 

Comment 6.1. Theorem [5l derived by the second-named author, shows that the double-Post- 
Lasso estimator asymptotically achieves the semi-parametric efficiency bound under a set of 
technical conditions and the following key growth condition: s 2 log 2 (p V n) = o(n). This rate 
condition requires the conditional expectations to be sufficiently smooth so that a relatively 
small number of series terms can be used to approximate them well. As in the case of the IV 
estimator, this condition can be replaced with the weaker condition that slog(p Vn) = o(n) by 
employing a sample splitting method of Fan, Guo, and Hao (2011). This is done in a companion 
paper, which also deals with a more general setup, covering non-Gaussian, heteroscedastic 
disturbances (Belloni, Chernozhukov, and Hansen 2011). □ 

Comment 6.2. The post double selection estimator is formulated in response to the inferential 
non-robustness properties of the post single selection procedures. The non-robustness of the 
latter is in line with the uniformity /robustness critique developed by Potscher (2009). The 
post double selection procedure developed here is in part motivated as a constructive response 
to this uniformity critique. The need for such constructive response was stressed by Hansen 
(2005). The goal here is to produce an inferential method which gives useful confidence intervals 
that are as robust as possible. Indeed, this robustness is captured by the fact that Theorem 
[5] permits the data-generating process (dgp) to change with n, as explicitly stated in the 
Notation section. Thus conclusions of the theorem are valid for a wide variety of sequences of 
dgps. However, while this construction partly addresses the uniformity critique, it does not 
achieve "full" uniformity, that is, it does not achieve validity over all potential sequences of 
dgps. However, we should not interpret this as a deficiency, if the potential sequences causing 
invalidity are thought of as implausible or unlikely (see Gine and Nickl (2010)). Finally, it 
would be desirable to have a useful procedure that is valid under all sequences of dgps, but 
such a procedure does not exist. □ 
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6.2. Monte Carlo Example: Partially Linear Models. In this section, we compare the 
estimation strategies proposed above in the following model: 

Vi = d'^ + x% + Cu &~N(0,a 2 c ) (6.53) 

where the covariates x ~ iV(0,£), Y>kj = (0.5)' J ~ fc ', and 

di = x% + vt, Vi ~ N(0, a 2 v ) (6.54) 

with = a v = 1, and oc v = 0. The dimension p of the covariates x is 200, and the sample 
size n is 100. We set Qo = 1 and 

/ 1111 1111 Y 

A = (1,-,-,^,-, o,o, o,o, o,i,-,-, ^^o,...^^ , 

/ 11111111 i y 

??o - l^ 1 ' 2' 3' 4' 5' 6' 7' 8' 9' 10' ' 'J' 
We set A according to the X-dependent rule with 1 — 7 = .95. For each repetition we draw 
new x's, £'s and u's. 

We summarize the inference performance of these methods in Table |4] which illustrates mean 
bias, standard deviation, and rejection probabilities of 95% confidence intervals. As we had 
expected, Lasso and Post-Lasso exhibit a large mean bias which dominates the estimation 
error and results in poor performance of conventional inference methods. On the other hand, 
the Indirect Post-Lasso has a small bias relative to estimation error but is substantially more 
variable than double-Post-Lasso and produces a conservative test, a test with size much smaller 
than the nominal level. Notably, the double-Post-Lasso provides coverage that is close to the 
promised 5% level and has the smallest mean bias and standard deviation. 

Partial Linear Model Simulation Results 



Estimator 


Mean Bias 


Std. Dev. 


rp(0.05) 


Lasso 


0.644 


0.093 


1.000 


Post-Lasso 


0.415 


0.209 


0.877 


Indirect Post-Lasso 


0.0908 


0.194 


0.004 


Double selection 


-0.0041 


0.111 


0.054 


Double selection Oracle 


0.0001 


0.110 


0.051 


Oracle 


-0.0003 


0.100 


0.044 



TABLE 4. Results are based on 1000 simulation replications of the partially linear model 
(I6.53|l where p — 200 and n — 100. We report mean bias (Mean Bias), standard deviation (Std. 
Dev.), and rejection frequency for 5% level tests (rp(.05)) for the four estimators described in 
Section 7.1. 



7. Empirical Examples. 

In this section, we illustrate the performance of sparse methods in two empirical examples. In 
the first, we revisit the classic Angrist and Krueger (1991)'s instrumental variables estimation 
of the returns to schooling. In this example, there are many instruments which can potentially 
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be used in forming the IV estimator and there are concerns about the potential biases and 
inferential problems introduced from using many instruments. Our results show that sparse 
methods can be effectively used to alleviate these concerns. The second example concerns the 
use of ^-penalized methods to select control variables for growth regressions in which there are 
many possible country level controls relative to the number of countries. Using Square-root 
Lasso to select control variables, we find that there is evidence in favor of the hypothesis of 
convergence. 

7.1. Angrist and Krueger Example with 1530 instruments. We consider the Angrist 
and Krueger (1991) model 

yii = 0iV2i + w'il + Ci) E[C»|tu»,Zi] = 0, 
V2i = z'iP + w'iS + Vi, E[vi\wi, Zi] = 0, 

where yu is the log(wage) of individual i, t/2i denotes education, Wi denotes a vector of control 
variables, and Zi denotes a vector of instrumental variables that affect education but do not 
directly affect the wage. The data were drawn from the 1980 U.S. Census and consist of 
329,509 men born between 1930 and 1939. In this example, Wi is a set of 510 variables: a 
constant, 9 year-of-birth dummies, 50 state-of-birth dummies, and 450 state-of-birth x year- 
of-birth interactions. As instruments, we use three quarter-of-birth dummies and interactions 
of these quarter-of-birth dummies with the set of state-of-birth and year-of-birth controls in 
Wi giving a total of 1530 potential instruments. Angrist and Krueger (1991) discusses the 
endogeneity of schooling in the wage equation and provides an argument for the validity of 
Zi as instruments based on compulsory schooling laws and the shape of the life-cycle earnings 
profile. We refer the interested reader to Angrist and Krueger (1991) for further details. The 
coefficient of interest is 9±, which summarizes the causal impact of education on earnings. 

There are two basic options for estimating 6\ that have been used in the literature: one uses 
just the three basic quarter-of-birth dummies and the other uses 180 instruments corresponding 
to the three quarter-of-birth dummies and their interactions with the 9 main effects for year- 
of-birth and 50 main effects for state-of-birth. It is commonly-held that using the set of 180 
instruments results in 2SLS estimates of 9\ that have a substantial bias, while using just the 
three quarter-of-birth dummies results in an estimator with smaller bias but a large variance; 
see, e.g., Hansen, Hausman, and Newey (2008). Another approach uses the 180 instruments and 
the Fuller estimator (Fuller 1977) (FULL) with an adjustment for the use of many instruments. 
Of course, using sparse methods for the first-stage estimation offers another option that could 
be used in place of any of the aforementioned approaches. 

Table 5 presents estimates of the returns to schooling coefficient using 2SLS and FULL0 
and different sets of instruments. Given knowledge of the construction of the instruments, the 
first three rows of the table correspond to the natural groupings of the instruments into the 

13 We set the user-defined choice parameter in the Fuller estimator equal to one which results in a higher-order 
unbiased estimator. 
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Estimates of the Returns to Schooling in the Angrist-Krueger Data 



Number of 
Instruments 


2SLS Estimate 2SLS Std. Error Fuller Estimate 


Fuller Std. Error 


3 

180 
1530 


0.1079 0.0196 0.1087 
0.0928 0.0097 0.1063 
0.0712 0.0049 0.1019 


0.0200 
0.0143 
0.0422 


Lasso - Iterated 


1 


0.0862 0.0254 




Lasso - 10-Fold Cross- Validation 


12 


0.0982 0.0137 0.0997 


0.0139 


Number of 
Instruments 


Sup-Score/Inverse Lasso 95% Confidence Interval 
Center of CI Quasi Std. Error Confidence Interval 




3 

180 
1530 


.100 0.0255 (0.05,0.15) 
.110 0.0459 (0.02,0.20) 
.095 0.0689 (-0.04,0.23) 





TABLE 5 . This table reports estimates of the returns-to-schooling parameter in the Angrist 
and Krueger 1991 data for different sets of instruments. The columns 2SLS and 2SLS Std. 
Error give the 2SLS point estimate and associated estimated standard error, and the columns 
Fuller Estimate and Fuller Std. Error give the Fuller point estimate and associated estimated 
standard error. We report Post-Lasso results based on instruments selected using the plug-in 
penalty described in Section 3.1 (Lasso - Iterated) and based on instruments selected using a 
penalty level chosen by 10- Fold Cross- Validation (Lasso - 10- Fold Cross- Validation) . For the 
Lasso-based results, Number of Instruments is the number of instruments selected by Lasso. 

three main quarter of birth effects, the three quarter-of-birth dummies and their interactions 
with the 9 main effects for year-of-birth and 50 main effects for state-of-birth, and the full set 
of 1530 potential instruments. The remaining two rows give results based on using Lasso to 
select instruments with penalty level given by the simple plug-in rule in Section 3 or by 10-fold 
cross-validation. Using the plug-in rule, Lasso selects only the dummy for being born in the 
fourth quarter; and with the cross-validated penalty level, Lasso selects 12 instruments which 
include the dummy for being born in the third quarter, the dummy for being born in the fourth 
quarter, and 10 interaction terms. The reported estimates are obtained using Post-Lasso. 

The results in Table 5 are interesting and quite favorable to the idea of using Lasso to do 
variable selection for instrumental variables. It is first worth noting that with 180 or 1530 
instruments, there are modest differences between the 2SLS and FULL point estimates that 
theory as well as evidence in Hansen, Hausman, and Newey (2008) suggests is likely due to 
bias induced by overfitting the 2SLS first-stage which may be large relative to precision. In 
the remaining cases, the 2SLS and FULL estimates are all very close to each other suggesting 
that this bias is likely not much of a concern. This similarity between the two estimates 
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is reassuring for the Lasso-based estimates as it suggests that Lasso is working as it should 
in avoiding overfitting of the first-stage and thus keeping bias of the second-stage estimator 
relatively small. 

For comparing standard errors, it is useful to remember that one can regard Lasso as a 
way to select variables in a situation in which there is no a priori information about which 
of the set of variables is important; i.e. Lasso does not use the knowledge that the three 
quarter of birth dummies are the "main" instruments and so is selecting among 1530 a priori 
"equal" instruments. Given this, it is again reassuring that Lasso with the more conservative 
plug-in penalty selects the dummy for birth in the fourth quarter which is the variable that 
most cleanly satisfies Angrist and Krueger (1991) 's argument for the validity of the instrument 
set. With this instrument, we estimate the returns-to-schooling to be .0862 with an estimated 
standard error of .0254. The best comparison is FULL with 1530 instruments which also does 
not use any a priori information about the relevance of the instruments and estimates the 
returns-to-schooling as .1019 with a much larger standard error of .0422. One can be less 
conservative than the plug-in penalty by using cross-validation to choose the penalty level. 
In this case, 12 instruments are chosen producing a Fuller point estimate (standard error) of 
.0997 (.0139) or 2SLS point estimate (standard error) of .0982 (.0137). These standard errors 
are smaller than even the standard errors obtained using information about the likely ordering 
of the instruments given by using 3 or 180 instruments where FULL has standard errors of 
.0200 and .0143 respectively. That is, Lasso finds just 12 instruments that contain nearly all 
information in the first stage and, by keeping the number of instruments small, produces a 
2SLS estimate that likely has relatively small bias. We believe that these empirical results are 
reliable. In particular, we note that the first stage F statistic on the selected 12 instruments is 
approximately 20; our computational experiments in the previous section employ designs with 
F = 10 and F = 40 to show that this method works well for both estimation and inference 
purposes. 

As a final check, we report the 95% confidence interval obtained from the Sup-Score test of 
Section 5.2 based on the three natural groupings of 3, 180, and 1530 instruments. This test is 
robust to weak or non- identification and is simple to implement. For the three different sets 
of instruments, we obtain intervals that are much wider but roughly in line with the intervals 
discussed above. We note that our preferred method from the simulation section only makes 
use of the Sup-Score test when no instruments are selected, does a good job at controlling size 
in the simulation, and is more powerful than the Sup-Score test when the instruments contain 
signal about the endogenous variable. Using this procedure would lead us to use the much 
more precise IV-Lasso results. 

Overall, these results demonstrate that Lasso instrument selection is feasible and produces 
sensible and what appear to be relatively high-quality estimates in this application. The re- 
sults from the Lasso-based IV estimators are similar to those obtained from other leading 
approaches to estimation and inference with many-instruments and do not require ex ante 
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information about which are the most relevant instruments. Thus, the Lasso-based IV proce- 
dures should provide a valuable complement to existing approaches to estimation and inference 
in the presence of many instruments. 

7.2. Growth Example. In this section, we consider variable selection in an international 
economic growth example. We use the Barro and Lee (1994) data consisting of a panel of 138 
countries for the period of 1960 to 1985. We consider the national growth rates in GDP per 
capita as the dependent variable. In our analysis, we consider a model with p = 62 covariates 
which allows for a total of n = 90 complete observations. Our goal here is to provide estimates 
which shed light on the convergence hypothesis discussed below by selecting controls from 
among these covariates!^ 

One of the central issues in the empirical growth literature is the estimation of the effect of 
an initial (lagged) level of GDP per capita on the growth rates of GDP per capita. In particu- 
lar, a key prediction from the classical Solow-Swan-Ramsey growth model is the hypothesis of 
convergence which states that poorer countries should typically grow faster than richer coun- 
tries and therefore should tend to catch up with the richer countries over time. This hypothesis 
implies that the effect of a country's initial level of GDP on its growth rate should be negative. 
As pointed out in Barro and Sala-i-Martin (1995), this hypothesis is rejected using a simple 
bivariate regression of growth rates on the initial level of GDP. (In our case, regression yields 
a statistically insignificant coefficient of .00132.) In order to reconcile the data and the theory, 
the literature has focused on estimating the effect conditional on characteristics of countries. 
Covariates that describe such characteristics can include variables measuring education and 
science policies, strength of market institutions, trade openness, savings rates and others; see 
(Barro and Sala-i-Martin 1995). The theory then predicts that the effect of the initial level of 
GDP on the growth rate should be negative among otherwise similar countries. 

Given that the number of covariates we can condition on is comparable to the sample size, 
covariate selection becomes an important issue in this analysis; see Levine and Renelt (1992), 
Sala-i-Martin (1997), Sala-i-Martin, Doppelhofer, and Miller (2004). In particular, previous 
findings came under severe criticisms for relying upon ad hoc procedures for covariate selection; 
see, e.g., Levine and Renelt (1992). Since the number of covariates is high, there is no simple 
way to resolve the model selection problem using only standard tools. Indeed the number of 
possible lower-dimensional model is very large, though see Levine and Renelt (1992), Sala-i- 
Martin (1997) and Sala-i-Martin, Doppelhofer, and Miller (2004) for attempts to search over 
millions of these models. Here we use ^i-penalized methods to attempt to resolve this important 
issue. 

We first present results for covariate selection using the different methods discussed in Section 
[U (a) a simple Post-Square-root-Lasso method which uses controls selected from applying the 



We can compare our results to those obtained in other standard models in the growth literature such as 
(Barro and Sala-i-Martin 1995, Koenker and Machado 1999). 
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Model Selection Results for the International Growth Regressions 
Real GDP per capita (log) is included in all models 



Selection Method 


Additional Variables Selected 


Square-root Lasso 


Black Market Premium (log) 


Double selection 


Terms of trade shock 




Infant Mortality Rate (0-1 age) 




Female gross enrollment for secondary education 




Percentage of "no schooling" in the female population 



Percentage of "higher school attained" in the male population 
Average schooling years in the female population over the age of 25 



TABLE 6. The controls selected by different methods. 



Square-root-Lasso to select controls in the regression of growth rates on log-GDP and other 
controls, and (b) the Post-double-selection method, which uses the controls selected by Square- 
root-Lasso in the regression of log-GDP on other controls and in the regression of growth rates 
on other controls. These were all based on Square-root Lasso to avoid the estimation of a. We 
present the model selection results in Table [6l 

Square-root Lasso applied to the regression of growth rates on log-GDP and other controls 
selected only one control, the log of the black market premium which characterizes trade 
openness. The double selection method selected infant mortality rate, terms of trade shock, 
and several education variables (female gross enrollment for secondary education, percentage 
of "no schooling" in the female population, percentage of "higher school attained" in male 
population, and average schooling years in female population over the age of 25) to forecast 
log-GDP but no additional controls were selected to forecast growth. We refer the reader 
to Barro and Lee (1994) and Barro and Sala-i-Martin (1995) for a complete definition and 
discussion of each of these variables. 

We then proceeded to construct confidence intervals for the coefficient on initial GDP based 
on each set of selected variables. We also report estimates of the effect of initial GDP in a model 
which uses the set of controls obtained from the double-selection procedure and additionally 
includes the log of the black market premium. We expressly allow for such amelioration 
strategy in our formal construction of the estimator. Table [7| shows these results. We find that 
in all these models the linear regression coefficients on the initial level of GDP are negative. 
In addition, zero is excluded from the 90% confidence interval in each case. These findings 
support the hypothesis of (conditional) convergence derived from the classical Solow-Swan- 
Ramsey growth model. The findings also agree with and thus support the previous findings 
reported in Barro and Sala-i-Martin (1995) which relied on ad- hoc reasoning for covariate 
selection. 
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Confidence Intervals after Model Selection 
for the International Growth Regressions 



Method 


Real GDP per capita (log) 
Coefficient 90% Confidence Interval 


Post Square-root Lasso 
Post Double selection 

Post Double selection (+ Black Market Premium) 


-0.0112 [-0.0219,-0.0007] 
-0.0221 [-0.0437,-0.0005] 
-0.0302 [-0.0509, -0.0096] 



TABLE 7. The table above displays the coefficient and a 90% confidence interval associated 
with each method. The selected models are displayed in Table [6] 



8. Conclusion 

There are many situations in economics where a researcher has access to data with a large 
number of covariates. In this article, we have presented results for performing analysis of such 
data by selecting relevant regressors and estimating their coefficients using ^-penalization 
methods. We gave special attention to the instrumental variables model and the partially 
linear model, both of which are widely used to estimate structural economic effects. Through 
simulation and empirical examples, we have demonstrated that l\ penalization methods may 
be usefully employed in these models and can complement tools commonly employed by applied 
researchers. 

Of course, there are many avenues for additional research. The use of ^-penalization is 
only one method of performing estimation with high-dimensional data. It will be interesting 
to consider and understand the behavior of other methods (e.g. Huang, Horowitz, and Ma 
(2008), Fan and Li (2001), Zhang (2010), Fan and Liao (2011)) for estimating structural 
economic objects. In addition, extending HDS models and methods to other types of economic 
models beyond those considered in this article will be interesting. An important problem 
in economics is the analysis of high-dimensional data in which there are many weak signals 
within the set of variables considered in which case the sparsity assumption may provide a 
poor approximation. The sup-score test presented in this article offers one approach to dealing 
with this problem, but further additional research dealing with this issue seems warranted. It 
would also be interesting to consider efficient use of high-dimensional data in cases in which 
scores are not independent across observations which is a much-considered case in economics. 
Overall, we believe the results in this article provide useful tools for applied economists but 
that there are still substantial and interesting topics in the use of high-dimensional economic 
data that warrant further investigation. 

Appendix A. Iterated Estimation of the Noise Level a 

In the case of Lasso, the penalty levels (|3.9|) and (|3.10|) require the practitioner to fill in a value for 
a. Theoretically, any upper bound on a can be used and the standard approach in the literature is 
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to use the conservative estimate a = ^/Var„ [i/i] := y/K n [(j/j — y) 2 }, where y = E n [yi\. Unfortunately, 
in various examples we found that this approach leads to overpenalization. Here we briefly discuss 
iterative procedures to estimate a similar to the ones described in Belloni and Chcrnozhukov (2011b). 
Let Io be a set of regressors that is included in the model. Note that Iq is always non-empty since it 
will always include the intercept. Let /?(Io) be the least squares estimator of the coefficients on the 
covariates associated with Io, and define aj := yjK n [(yi — x^/3(/o)) 2 ]- 

An algorithm for estimating a using Lasso is as follows: 

Algorithm 1 (Estimation of a using Lasso iterations). For a positive number tp, set <7o = tp^io- Set 
k = 0, and specify a small constant v as a tolerance level and a constant K > 1 as an upper bound 
on the number of iterations. (1) Compute the Lasso estimator j3 based on X — 2cd k A(l — r y\X).(2) Set 
Sfc+i = Qifi)- (3) If \Sk+i — &k\ ^ v or k > K , report a = &k+i; otherwise set k <— k + 1 and go to (1). 

Similarly, an algorithm for estimating a using Post-Lasso is as follows: 

Algorithm 2 (Estimation of a using Post-Lasso iterations). For a positive number ip, set er° = ipai . 
Set k = 0, and specify a small constant v as a tolerance level and a constant K > 1 as an upper bound 
on the number of iterations. (1) Compute the Post-Lasso estimator (3 based on X = 2c(jfeA(l — j\X). 
(2) Fors = ||/3|| = |f| set d\ +1 = Q0) ■ n/(n - s). (3) If \d k +\ - d k \^vork> K, report a = a k+1 ; 
otherwise, set k <— k + 1 and go to (1). 

Comment A.l. We note that we employ the standard degree-of-freedom correction with s"= \\/3\\o — 
\T\ when using Post-Lasso (Algorithm 2). No additional correction is necessary when using Lasso 
(Algorithm 1) since the Lasso estimate is already sufficiently regularized. We note that the sequence 
effc, k ^ 2, produced by Algorithm 1 is monotone and that the estimates a k , k ^ 1, produced by 
Algorithm 2 can only assume a finite number of different values. Belloni and Chcrnozhukov (2011b) 
and Belloni and Chcrnozhukov (2011c) provide theoretical analysis for ip = 1. In preliminary simulations 
with coefficients that were not well separated from zero, we found that tf> = 0.1 worked better than 
%l> = 1 by avoiding unnecessary overpenalization in the first iteration. □ 



Appendix B. Proof of Theorem O 



Step 1. Recall that Ai = {f{zi),w' i ) 1 and di = (y2i,w' i )' for i = l,...,n. Let X = [xi, . . . , x n ]', 
A = [A u ...,A n ]', D = [d u ...,d n ]', W = [ Wl ,...,w n ]', f = [f( Zl ),...,f(z n )Y, Y 2 = [y 21 ,..., y 2n ]', V = 
[vi, v n ]', and ( = [Ci, ( n }' ■ We have that 

Vn~(a*-a) = [A'D/n^A'C/V^ = [Qn + op(I)] - 1 (A'C/Vn + o P (l)) 
where by Steps 3 and 4 below: 

A'D/n = A'D/n + o P (l) = Q n + o P (l) (B.55) 

A'C/-s/n = A'C/Vn + o P (l). (B.56) 

Moreover, by the assumption on oq and Q n , Var(A' '£ / \/n) = cr 2 Q n has eigenvalues bounded away from 
zero and bounded from above, uniformly in n. Therefore, ^/n{a* — ao) = Q^-^'C/V^ + op(l), and 
Q^A'i^/^/n is a vector distributed as normal with mean zero and covariance a^Q' 1 . This verifies the 
main claim of the theorem. 
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Step 2. This is an auxiliary step where we note that conditions of the theorem imply by Markov 
inequality: 

f'f/n + tx{W'W/n) = tr(A'A/n) = tr(Q„) < 1, 
\\D'C/n\\ ^ \V'C/n\ + \\A'C/n\\ < P a Cv + 
\\A'V/n\\ 2 = \f'V/n\ 2 + \\W'V/n\\ 2 < P 1/n, 
\\D/VE\\ ^ \\V/VE\\ + \\A/y/n\\ <p 1. 

Step 3. To show ([535)) . note that A - A = (/' - /', 0')'- Thus, 



\A'D/n-A'D/n\\ = \(f-f)%/n\ < \] {J- /)'(/- f)/n^Y^Y 2 /n = o P (l) 



since ^/YJ,Y 2 /n < P 1 by Markov inequality, and \J (/ - /)'(/ - /)/ n = op(l) by Theorems[T]or[21 Next, 
since f'V/n = o P (l) and W'V/n = o P (l) by Step 2, note that A'D/n = A'A/n + o P (l) = Q n + o P (l). 

Step 4. To show (|R56|) . note that 

\\(A-A)'C/V^\\ = \(f- /)'C/Vn| - |pr(£ - A,))'CA/« + (/ - WC/V^I 

«s ll*'C/V»IL 110 - A)||i + 1(/ - WC/V»I o. 



This follows because the first term is of order ^log(p V n) \J s 1 log(p V n) / n — > by conditions of 
the theorem; the order follows because H^'C/V^-lloo r$f V^°g(pV"T by (|3.1ip . and ||/3 — /3 1 j i <p 
y[s21og(p V n)]/n by Theorems [T] and [2] since ||/3-/3 ||i < Vs + s\\P - /3 || <p VI s2 k>g(pV ra)]/n 
under condition SE and s <p s. On the other hand, the second term converges to zero in probability 
by Markov inequality, because the expectation of |(/ — X /3q)X / y/n\ 2 is of order crjc 2 — > 0. 



Step 5. This step establishes consistency of the variance estimator. Since a 2 and the eigenvalues of 
Q n are bounded away from zero and from above uniformly in n, it suffices to show a 2 — a 2 — > P and 
A'A/n- Q n 0. Indeed, a 2 = \\(- D{a* - a Q )\\ 2 /n = ||CI| 2 /™ + K'D{a - a*)/n + \\D(a - a*)\\ 2 /n 
so that ||CI| 2 / n ~ a \ by Chebyshev inequality since max, E[<^ 4 ] is bounded uniformly in n, and the 
remaining terms converge to zero in probability since a* — «o 0, ||Z?'£/n|| <p 1 by Step 2. Next, 
note that 

\\A'A/n - A'A/n\\ = \\A\A - A)/n + (A - A)'A/n + (A -A)' (A - A)/n\\ 

which is bounded up to a constant by (\\A— A\\/y/n)(\\ A\\/y/n) + \\A — A\\ 2 /n^ P since — A|| 2 /n = 
||7- f\\ 2 /n = o P (l) by Theorems [U or [U and \\A\\ 2 /n < P 1 holding by Step 2. □ 

Appendix C. Proof of Theorem 0] 
Step 1. When a = ol\ we have that 

n|E n [ei£y]| n|E n [</;%]| 
A ai = max — = max 



so claim (1) follows from the definition of quantile and from the continuity of the distribution of A ai . 
Step 2. To establish claim (2), we note that 

nE n \giXij] = riE n \giXij] = y/riNj ^jE n ~\i 2 j\ = y/riAfj, 
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where TV} ~ iV(0, 1) for each j. Since for fl g = (E„[u>iU^]) _1 E„[ui,<7j] we have <p l/\/n by the 

assumed boundedness of ||(E n [u;jii^]) _1 || and boundedncss of \\wi\\, we conclude that max,^ n l^iMsl ~P 
1/y/n, so that 

uniformly in j G using the triangular inequality and the decomposition gi ~ gi — w'fig- 

Moreover, using the Bernstein- type inequality in Lemma 5.15 of van de Geer (2000), we can conclude 
that 



|E„[ A 2 4] - E n [x?.]| < P ^(logp)/n, 

uniformly in j G {1, Hence since E n [x^-] = 1 by the normalization assumption, we conclude that 

with probability approaching 1, 



A ai ^ max cn|E n [grixL]|/*/E n [S?.] = max cy/n\J\f 3 



and the claim (2) follows by the union bound and standard tail properties of N(0, 1). 

Step 3. To show claim (3) we note that using triangular and other elementary inequalities: 

n|E„[(ei - (a - ai)y 2 i)xij] 



An = 



max 



5* max 



^jE n [(ii - (a - ai)y 2 i) 2 ^-] 
\a - ai\n\E n [y 2 iXij]\ 



r,2 z-2 1 



The first term is bounded below by, with probability approaching 1, 

_i \a - cti\\nE n [y 2 iXij]\ 



Ar 



by Step 2 for some c > 1, and A Ql <p ^/ n log p by Step 2. Hence for any constant C, with probability 
converging to 1, A a — C\Jn logp — » +oo, so that Claim (3) immediately follows, since by Step 2 
A(l — 7|AT, W) < A(l — 7) < Vnlogp, since 7 G (0, 1) is fixed by assumption. □ 



Appendix D. Proof of Theorem [5] 

Let me prepare some notation. I will use the standard matrix notation, namely Y\ = [yn, ...,yin]', 
X = [xi,...,x n ]', D = [di, ...,d„]', V = [vi,...,v n y, C = [Ci, — ) Cn]', m = [mi, m n ]' for rm = m(zi), 
R m = [r m i,...,r mn ]', g = {gi,...,g n ]' for gi = g{z l ) 1 R g = [r g x, r gn ]', and so on. Let 4> m m(s) = 
</>min(s)[E„[xiO;^]. For A C {l,...,p}, let X[A] = {Xj,j G A}, where {Xj,j = are the columns 

of X. Let 

V A = X[A](X[A]'X[A])-X[A]' 

be the projection operator sending vectors in l n onto spanLYL4]], and let Ma = Li — Va be the 
projection onto the subspace that is orthogonal to span [X [A]]. For a vector Z G W 1 , let 

0z(A) := argmin \\Z - X'bf : bj = 0, Vj 4 A, 

heft? 11 J 

be the coefficient of linear projection of Z onto spanLY[A]]. If A = 0, interpret Va = n , and fiz = p . 
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Step l.(Main) Write a = [D'MfD/n]^ 1 [D'MjYx/n] so that 

V^(a - ao) = [D'M f D/n] ^ [D> 'M T (g + Q/Vn\ =■ « -1 • ». 

By Steps 2 and 3, ii = V'V/n+o P (i) and i = V'(/y/n+o P (l). Since V'V/n = ct 2 +o p (1) by Chebyshcv 
inequality, and a 2 and a 2 are bounded from above and away from zero by assumption, and 

V'C/V^ = [*( VW/n)N(0, 1) 

conclude that 

(T^ 1 (V'V/n) 1 / 2 V^(a - a ) = iV(0, 1) + o P (l). 

Step 2. (Behavior of i.) Decompose 

i = V'CIV^ + m'Mfg/Vn + m'MjC/V^ + V'Mjg/^i - WjC/V^- (D.57) 

= :i a =:»(, =:i c =:i d 

First, note that by Steps 4 and 5 and by the growth condition s 2 log 2 (p V n) = o(n) 

\i a \ «S Vn\\m'M f /Vri\\\\g'M T /Vn\\ < P Vn^[s log(p V n)] 2 /n 2 = o P (l). 
Second, using decomposition m = X/3 m0 + R m , bound 

\ib\ < \KnC/V^\ + \0m(T) - p m o)'X'C/M <p ^[s\og{p\Jn)] 2 /n = o P (l), 
where \R' m (/\/n\ < P y 'R' m R m /n < ^Js/n by Chebyshev inequality and by assumption ASTE, and 
\0m{I) - Pmo)'X'(/V^\ ^ WPmV) - M i ||*'C/Vn||oo <p log(p V n)]/n Vlog(p V nj, 

||/3 m (I)-/? m0 ||i < \/s\\Pm{I)-Pmo\\ <p VI s2 log(p V n)]/n by Step 4, using thats < P s by Theorem 2, 
|| A'CZ-^/nHoo <p ^J\og(p V n) by the Gaussian maximal inequality (|3 . 1 1 1) and normalization condition 
on X. Third, using similar reasoning, decomposition g = Xf3 g o + R g , and Step 5, conclude 

\ic\ < \R' g C\ + \0 9 (l) - P g0 )'X'V/V^\ <p Vl^og( P Vn)] 2 /n = o P (l). 

Fourth, using that s"< p s by Theorcm[5]so that l/</> m i n (s) < p 1 by condition SE, conclude, 

\i d \ < \p v (J)'X'C/M < ||/3y(/)||i||X'C/V^||oo <P V ^A/[slog(pVn)]2/n2 = 0p(1 ) ) 

sin ce jjM^jji < \/51IM/)|| < VSiKXtTl'XI/])- 1 ^^^!! < VsfeL(s) XV/y/EWoo/yfii <p 

Step 3. (Behavior of ii.) Decompose 

ii = ( m + ^)'7W f (m + V)/n = V'V/n + m'M T m/n + 2m'MfV/n - V'VfV/n. 

— :ii a —'-iib —:ii c 

Then |u a | < P [,slog(p V n)]/n = o P (l) by Step 4, |wf,| < P [s log(p V n)]/n = o P (1) by reasoning similar 
to deriving the bound for |zf,|, and |m c | < p [slog(p Vn)]/n = op(l) by reasoning similar to deriving the 
bound for 

Step 4. (Auxiliary: Bound on ||.Mjto|| and related quantities.) Observe that 
V[slog(pVn)]/n > P \\M T m/y/n\\ > P \\Mfm/^\\ > P \\\X0 m (T) - m o)/M\ - \\Rm/M\\ 

(1) (2) (3) 



HIGH-DIMENSIONAL SPARSE ECONOMETRIC MODELS 



39 



where inequality (1) holds since by Theorem 2 || Aij m/y/n\ ^ \\(X /3d(Ii)— m)/y / nj <p y/[s log(p V n)]/n, 
(2) holds by 1\ C 7, and (3) by the triangle inequality. Since ||i? m /v^i|| < yfajn by assumption ASTE, 
conclude that wp — > 1, 

v/[s log(pVn)]/n > P \\X0 m {I) - f3 m0 )/V^\\ 

> V<t>mm(s)\\MT) - Pm0\\ >P \\P m (T) ~ Pmol 

since s <p s by Theorem[5]so that l/</> m in(s) 1 by condition SE. 

Step 5. (Auxiliary: Bound on ||-My<7|| and related quantities.) Observe that 

y/[slog(p V n)]/n > P \\Mj (a m + g)/y/ri\\ 
(i) 

>p || Mf(a m + g)/\/ri\\ >p \\\Mj-g/\/n\\ - \\Mja m/^/n\\\ 

(2) (3) 

where inequality (1) holds since by Theorcm[2]||A^(aoTO + <?)/v / "'|| ^ \\(X Py-l^) — a^m — g)/y/n\\ <p 
\/[s log(p V n)]/n, (2) holds by I<i Q I, and (3) by the triangle inequality. Since ||a || is bounded 
uniformly in n by assumption, by Step 4, |j M. jaom/y/n\\ <p y^[s log(p V n)]/n. Hence conclude that 

^[ S \og(pWn)}/n> P \\M T g/V^\\ > \\\X(j} g (?) - Pgo)/y/n\\ - \\Rg/y/n\\\ 

where ||i? 9 /-y/ri|| < y/s/n by condition ASTE. Then conclude similarly to Step 4 that wp — > 1, 

y/[slog( P W n)]/n > P \\X0 g (T) - f3 g0 )/V^\\ > y/faJ?j\\0g(l) - P g o\\ >p \\P g (T) - M- 

Step 6. (Variance Estimation.) Since s~<p s = o(n), (n — 's— l)/n = 1 + op(l). Hence consider 
o\ = || (Fx - aD)'M T \\ 2 /n = \\(( + (a - &)' D + g)' Mf\\ 2 /n. 
Then by Steps 1,3, and 5 

\a-\\CM f \\/M\\ < WMjW/yfr+Wa-aoWWD'MjW/yfrZp y/[s log(p V n)]/n + n^l 2 = o P (l). 
Moreover, 

\\('Mff/n = C'C/n - C'VjC/n = a\ + o P (l), 

where C'C/ n = a \ + Op(" -1 ^ 2 ) by Chebyshev inequality and ('Vf(/n <p [slog(p V n)]/n = op(l) by 
the argument similar to that used to bound \id\- □ 
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